Data Analysis & Visualization - Michael ChengΒΆ

Project Problem Statement - Auto-mpg AnalysisΒΆ

Background_1ΒΆ

Context

The shifting market conditions, globalization, cost pressure, and volatility are leading to a change in the automobile market landscape. The emergence of data, in conjunction with machine learning in automobile companies, has paved a way that is helping bring operational and business transformations.

The automobile market is vast and diverse, with numerous vehicle categories being manufactured and sold with varying configurations of attributes such as displacement, horsepower, and acceleration. We aim to find combinations of these features that can clearly distinguish certain groups of automobiles from others through this analysis, as this will inform other downstream processes for any organization aiming to sell each group of vehicles to a slightly different target audience.

You are a Data Scientist at SecondLife which is a leading used car dealership with numerous outlets across the US. Recently, they have started shifting their focus to vintage cars and have been diligently collecting data about all the vintage cars they have sold over the years. The Director of Operations at SecondLife wants to leverage the data to extract insights about the cars and find different groups of vintage cars to target the audience more efficiently. [1]

Objective

The objective of this problem is to explore the data, extract meaningful insights, and find different groups of vehicles in the data by using dimensionality reduction techniques like PCA and t-SNE.

Data Description: There are 8 variables in the dataset:

  • mpg: miles per gallon

  • cyl: number of cylinders

  • disp: engine displacement (cu. inches) or engine size

  • hp: horsepower

  • wt: vehicle weight (lbs.)

  • acc: time taken to accelerate from 0 to 60 mph (sec.)

  • yr: model year

  • car name: car model name

Import Libraries & Load the dataΒΆ

InΒ [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Importing PCA and t-SNE
from sklearn.decomposition import PCA

from sklearn.manifold import TSNE

# Summary Tools
from summarytools import dfSummary
data = pd.read_csv(r"C:\Users\mikecbos\Downloads\MIT_Elective-AutoMPG\auto-mpg.csv")

DataPreprocessing_1ΒΆ

InΒ [2]:
# Copy of data
df = data.copy()

# Overview of data
print(df.head())
df.info()
dfSummary(df)
    mpg  cylinders  displacement horsepower  weight  acceleration  model year  \
0  18.0          8         307.0        130    3504          12.0          70   
1  15.0          8         350.0        165    3693          11.5          70   
2  18.0          8         318.0        150    3436          11.0          70   
3  16.0          8         304.0        150    3433          12.0          70   
4  17.0          8         302.0        140    3449          10.5          70   

                    car name  
0  chevrolet chevelle malibu  
1          buick skylark 320  
2         plymouth satellite  
3              amc rebel sst  
4                ford torino  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    398 non-null    object 
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   model year    398 non-null    int64  
 7   car name      398 non-null    object 
dtypes: float64(3), int64(3), object(2)
memory usage: 25.0+ KB
Out[2]:
Data Frame Summary
df
Dimensions: 398 x 8
Duplicates: 0
No Variable Stats / Values Freqs / (% of Valid) Graph Missing
1 mpg
[float64]
Mean (sd) : 23.5 (7.8)
min < med < max:
9.0 < 23.0 < 46.6
IQR (CV) : 11.5 (3.0)
129 distinct values No description has been provided for this image 0
(0.0%)
2 cylinders
[int64]
Mean (sd) : 5.5 (1.7)
min < med < max:
3.0 < 4.0 < 8.0
IQR (CV) : 4.0 (3.2)
5 distinct values No description has been provided for this image 0
(0.0%)
3 displacement
[float64]
Mean (sd) : 193.4 (104.3)
min < med < max:
68.0 < 148.5 < 455.0
IQR (CV) : 157.8 (1.9)
82 distinct values No description has been provided for this image 0
(0.0%)
4 horsepower
[object]
1. 150
2. 90
3. 88
4. 110
5. 100
6. 75
7. 95
8. 105
9. 70
10. 67
11. other
22 (5.5%)
20 (5.0%)
19 (4.8%)
18 (4.5%)
17 (4.3%)
14 (3.5%)
14 (3.5%)
12 (3.0%)
12 (3.0%)
12 (3.0%)
238 (59.8%)
No description has been provided for this image 0
(0.0%)
5 weight
[int64]
Mean (sd) : 2970.4 (846.8)
min < med < max:
1613.0 < 2803.5 < 5140.0
IQR (CV) : 1384.2 (3.5)
351 distinct values No description has been provided for this image 0
(0.0%)
6 acceleration
[float64]
Mean (sd) : 15.6 (2.8)
min < med < max:
8.0 < 15.5 < 24.8
IQR (CV) : 3.3 (5.6)
95 distinct values No description has been provided for this image 0
(0.0%)
7 model year
[int64]
Mean (sd) : 76.0 (3.7)
min < med < max:
70.0 < 76.0 < 82.0
IQR (CV) : 6.0 (20.6)
13 distinct values No description has been provided for this image 0
(0.0%)
8 car name
[object]
1. ford pinto
2. toyota corolla
3. amc matador
4. ford maverick
5. chevrolet chevette
6. amc gremlin
7. chevrolet impala
8. peugeot 504
9. amc hornet
10. toyota corona
11. other
6 (1.5%)
5 (1.3%)
5 (1.3%)
5 (1.3%)
4 (1.0%)
4 (1.0%)
4 (1.0%)
4 (1.0%)
4 (1.0%)
4 (1.0%)
353 (88.7%)
No description has been provided for this image 0
(0.0%)

The_overview_above_shows:ΒΆ

  1. There are no duplicates or missing values.
  2. The average model year is 1976, with a relatively low spread (3.7 years).
  3. The minimum year is 1970, the median is 1976, and the maximum is 1982, showing a symmetric spread around the median.
  4. With an Inter-Quartile Range (IQR) at 6.0; this suggests the middle 50% of model years span 6 years (i.e. from 1973 to 1979, depending on the exact quartiles).
  5. Coefficient of Variation (CV) shows a relative variability of 20.6%; this shows a moderately high standard deviation at about 20.6% of the mean, indicating reasonable variability in model years
  6. The dataset represents a vehicle's Performance Features in the following:
  • mpg
  • cylinders
  • displacement
  • horsepower
  • weight
  • acceleration
InΒ [3]:
# Objects Columns: Review 'car name'
df['car name'].unique()
Out[3]:
array(['chevrolet chevelle malibu', 'buick skylark 320',
       'plymouth satellite', 'amc rebel sst', 'ford torino',
       'ford galaxie 500', 'chevrolet impala', 'plymouth fury iii',
       'pontiac catalina', 'amc ambassador dpl', 'dodge challenger se',
       "plymouth 'cuda 340", 'chevrolet monte carlo',
       'buick estate wagon (sw)', 'toyota corona mark ii',
       'plymouth duster', 'amc hornet', 'ford maverick', 'datsun pl510',
       'volkswagen 1131 deluxe sedan', 'peugeot 504', 'audi 100 ls',
       'saab 99e', 'bmw 2002', 'amc gremlin', 'ford f250', 'chevy c20',
       'dodge d200', 'hi 1200d', 'chevrolet vega 2300', 'toyota corona',
       'ford pinto', 'plymouth satellite custom', 'ford torino 500',
       'amc matador', 'pontiac catalina brougham', 'dodge monaco (sw)',
       'ford country squire (sw)', 'pontiac safari (sw)',
       'amc hornet sportabout (sw)', 'chevrolet vega (sw)',
       'pontiac firebird', 'ford mustang', 'mercury capri 2000',
       'opel 1900', 'peugeot 304', 'fiat 124b', 'toyota corolla 1200',
       'datsun 1200', 'volkswagen model 111', 'plymouth cricket',
       'toyota corona hardtop', 'dodge colt hardtop', 'volkswagen type 3',
       'chevrolet vega', 'ford pinto runabout', 'amc ambassador sst',
       'mercury marquis', 'buick lesabre custom',
       'oldsmobile delta 88 royale', 'chrysler newport royal',
       'mazda rx2 coupe', 'amc matador (sw)',
       'chevrolet chevelle concours (sw)', 'ford gran torino (sw)',
       'plymouth satellite custom (sw)', 'volvo 145e (sw)',
       'volkswagen 411 (sw)', 'peugeot 504 (sw)', 'renault 12 (sw)',
       'ford pinto (sw)', 'datsun 510 (sw)',
       'toyouta corona mark ii (sw)', 'dodge colt (sw)',
       'toyota corolla 1600 (sw)', 'buick century 350',
       'chevrolet malibu', 'ford gran torino', 'dodge coronet custom',
       'mercury marquis brougham', 'chevrolet caprice classic',
       'ford ltd', 'plymouth fury gran sedan',
       'chrysler new yorker brougham', 'buick electra 225 custom',
       'amc ambassador brougham', 'plymouth valiant',
       'chevrolet nova custom', 'volkswagen super beetle', 'ford country',
       'plymouth custom suburb', 'oldsmobile vista cruiser',
       'toyota carina', 'datsun 610', 'maxda rx3', 'mercury capri v6',
       'fiat 124 sport coupe', 'chevrolet monte carlo s',
       'pontiac grand prix', 'fiat 128', 'opel manta', 'audi 100ls',
       'volvo 144ea', 'dodge dart custom', 'saab 99le', 'toyota mark ii',
       'oldsmobile omega', 'chevrolet nova', 'datsun b210',
       'chevrolet chevelle malibu classic', 'plymouth satellite sebring',
       'buick century luxus (sw)', 'dodge coronet custom (sw)',
       'audi fox', 'volkswagen dasher', 'datsun 710', 'dodge colt',
       'fiat 124 tc', 'honda civic', 'subaru', 'fiat x1.9',
       'plymouth valiant custom', 'mercury monarch', 'chevrolet bel air',
       'plymouth grand fury', 'buick century',
       'chevroelt chevelle malibu', 'plymouth fury', 'buick skyhawk',
       'chevrolet monza 2+2', 'ford mustang ii', 'toyota corolla',
       'pontiac astro', 'volkswagen rabbit', 'amc pacer', 'volvo 244dl',
       'honda civic cvcc', 'fiat 131', 'capri ii', 'renault 12tl',
       'dodge coronet brougham', 'chevrolet chevette', 'chevrolet woody',
       'vw rabbit', 'dodge aspen se', 'ford granada ghia',
       'pontiac ventura sj', 'amc pacer d/l', 'datsun b-210', 'volvo 245',
       'plymouth volare premier v8', 'mercedes-benz 280s',
       'cadillac seville', 'chevy c10', 'ford f108', 'dodge d100',
       'honda accord cvcc', 'buick opel isuzu deluxe', 'renault 5 gtl',
       'plymouth arrow gs', 'datsun f-10 hatchback',
       'oldsmobile cutlass supreme', 'dodge monaco brougham',
       'mercury cougar brougham', 'chevrolet concours', 'buick skylark',
       'plymouth volare custom', 'ford granada', 'pontiac grand prix lj',
       'chevrolet monte carlo landau', 'chrysler cordoba',
       'ford thunderbird', 'volkswagen rabbit custom',
       'pontiac sunbird coupe', 'toyota corolla liftback',
       'ford mustang ii 2+2', 'dodge colt m/m', 'subaru dl', 'datsun 810',
       'bmw 320i', 'mazda rx-4', 'volkswagen rabbit custom diesel',
       'ford fiesta', 'mazda glc deluxe', 'datsun b210 gx',
       'oldsmobile cutlass salon brougham', 'dodge diplomat',
       'mercury monarch ghia', 'pontiac phoenix lj',
       'ford fairmont (auto)', 'ford fairmont (man)', 'plymouth volare',
       'amc concord', 'buick century special', 'mercury zephyr',
       'dodge aspen', 'amc concord d/l',
       'buick regal sport coupe (turbo)', 'ford futura',
       'dodge magnum xe', 'datsun 510', 'dodge omni',
       'toyota celica gt liftback', 'plymouth sapporo',
       'oldsmobile starfire sx', 'datsun 200-sx', 'audi 5000',
       'volvo 264gl', 'saab 99gle', 'peugeot 604sl',
       'volkswagen scirocco', 'honda accord lx', 'pontiac lemans v6',
       'mercury zephyr 6', 'ford fairmont 4', 'amc concord dl 6',
       'dodge aspen 6', 'ford ltd landau', 'mercury grand marquis',
       'dodge st. regis', 'chevrolet malibu classic (sw)',
       'chrysler lebaron town @ country (sw)', 'vw rabbit custom',
       'maxda glc deluxe', 'dodge colt hatchback custom', 'amc spirit dl',
       'mercedes benz 300d', 'cadillac eldorado', 'plymouth horizon',
       'plymouth horizon tc3', 'datsun 210', 'fiat strada custom',
       'buick skylark limited', 'chevrolet citation',
       'oldsmobile omega brougham', 'pontiac phoenix',
       'toyota corolla tercel', 'datsun 310', 'ford fairmont',
       'audi 4000', 'toyota corona liftback', 'mazda 626',
       'datsun 510 hatchback', 'mazda glc', 'vw rabbit c (diesel)',
       'vw dasher (diesel)', 'audi 5000s (diesel)', 'mercedes-benz 240d',
       'honda civic 1500 gl', 'renault lecar deluxe', 'vokswagen rabbit',
       'datsun 280-zx', 'mazda rx-7 gs', 'triumph tr7 coupe',
       'ford mustang cobra', 'honda accord', 'plymouth reliant',
       'dodge aries wagon (sw)', 'toyota starlet', 'plymouth champ',
       'honda civic 1300', 'datsun 210 mpg', 'toyota tercel',
       'mazda glc 4', 'plymouth horizon 4', 'ford escort 4w',
       'ford escort 2h', 'volkswagen jetta', 'renault 18i',
       'honda prelude', 'datsun 200sx', 'peugeot 505s turbo diesel',
       'volvo diesel', 'toyota cressida', 'datsun 810 maxima',
       'oldsmobile cutlass ls', 'ford granada gl',
       'chrysler lebaron salon', 'chevrolet cavalier',
       'chevrolet cavalier wagon', 'chevrolet cavalier 2-door',
       'pontiac j2000 se hatchback', 'dodge aries se',
       'ford fairmont futura', 'amc concord dl', 'volkswagen rabbit l',
       'mazda glc custom l', 'mazda glc custom', 'plymouth horizon miser',
       'mercury lynx l', 'nissan stanza xe', 'honda civic (auto)',
       'datsun 310 gx', 'buick century limited',
       'oldsmobile cutlass ciera (diesel)', 'chrysler lebaron medallion',
       'ford granada l', 'toyota celica gt', 'dodge charger 2.2',
       'chevrolet camaro', 'ford mustang gl', 'vw pickup',
       'dodge rampage', 'ford ranger', 'chevy s-10'], dtype=object)

Decision Point:ΒΆ

  • Because the firm has "recently started" [1] shifting their attention to vintage vehicles, this dataset will contain vintage and non-vintage vehicles.
  • Although "vintage" may literally be designated to a particular year rather than the particular make/model of a vehicle, there is also a significant factor from cultural perception that plays into a vehicle's vintage value.
  • Since the firm desires to "leverage the data to extract insights about the cars and find different groups of vintage cars to target the audience more efficiently" [1], 'car name' will be considered along with other features.
InΒ [4]:
# Objects Columns: Review 'horsepower' 
df['horsepower'].unique()
Out[4]:
array(['130', '165', '150', '140', '198', '220', '215', '225', '190',
       '170', '160', '95', '97', '85', '88', '46', '87', '90', '113',
       '200', '210', '193', '?', '100', '105', '175', '153', '180', '110',
       '72', '86', '70', '76', '65', '69', '60', '80', '54', '208', '155',
       '112', '92', '145', '137', '158', '167', '94', '107', '230', '49',
       '75', '91', '122', '67', '83', '78', '52', '61', '93', '148',
       '129', '96', '71', '98', '115', '53', '81', '79', '120', '152',
       '102', '108', '68', '58', '149', '89', '63', '48', '66', '139',
       '103', '125', '133', '138', '135', '142', '77', '62', '132', '84',
       '64', '74', '116', '82'], dtype=object)
InΒ [5]:
# Undefined value "?" occurence
print("Instances of ? in 'horsepower'")
df['horsepower'].value_counts()['?']
Instances of ? in 'horsepower'
Out[5]:
6

Decision Point:ΒΆ

  • "Horsepower" should be converted to a numeric data type for meaningful analysis and visualization
  • "?" values determination: Use Regression (generalization based on global patterns) rather than KNN accounting for variabilities based on local patterns
InΒ [6]:
# Use Regression to predict "?" horsepower values

from sklearn.linear_model import LinearRegression

# Replace '?' with NaN and convert to numeric
df['horsepower'] = pd.to_numeric(df['horsepower'].replace('?', np.nan), errors='coerce')

# Split rows with and without missing horsepower
df_missing_hp = df[df['horsepower'].isna()]
df_non_missing_hp = df[~df['horsepower'].isna()]

# Features and target for non-missing rows
X = df_non_missing_hp[['mpg', 'cylinders', 'displacement', 'weight', 'acceleration', 'model year']]
y = df_non_missing_hp['horsepower']

# Train a regression model
model = LinearRegression()
model.fit(X, y)

# Predict missing horsepower
X_missing = df_missing_hp[['mpg', 'cylinders', 'displacement', 'weight', 'acceleration', 'model year']]
df.loc[df['horsepower'].isna(), 'horsepower'] = model.predict(X_missing)

# Display updated DataFrame
df
Out[6]:
mpg cylinders displacement horsepower weight acceleration model year car name
0 18.0 8 307.0 130.0 3504 12.0 70 chevrolet chevelle malibu
1 15.0 8 350.0 165.0 3693 11.5 70 buick skylark 320
2 18.0 8 318.0 150.0 3436 11.0 70 plymouth satellite
3 16.0 8 304.0 150.0 3433 12.0 70 amc rebel sst
4 17.0 8 302.0 140.0 3449 10.5 70 ford torino
... ... ... ... ... ... ... ... ...
393 27.0 4 140.0 86.0 2790 15.6 82 ford mustang gl
394 44.0 4 97.0 52.0 2130 24.6 82 vw pickup
395 32.0 4 135.0 84.0 2295 11.6 82 dodge rampage
396 28.0 4 120.0 79.0 2625 18.6 82 ford ranger
397 31.0 4 119.0 82.0 2720 19.4 82 chevy s-10

398 rows Γ— 8 columns

InΒ [7]:
# Check for null values
df['horsepower'].isna().sum() 
Out[7]:
0
InΒ [8]:
# Review unique values
df['horsepower'].unique()
Out[8]:
array([130.        , 165.        , 150.        , 140.        ,
       198.        , 220.        , 215.        , 225.        ,
       190.        , 170.        , 160.        ,  95.        ,
        97.        ,  85.        ,  88.        ,  46.        ,
        87.        ,  90.        , 113.        , 200.        ,
       210.        , 193.        ,  63.91642352, 100.        ,
       105.        , 175.        , 153.        , 180.        ,
       110.        ,  72.        ,  86.        ,  70.        ,
        76.        ,  65.        ,  69.        ,  60.        ,
        80.        ,  54.        , 208.        , 155.        ,
       112.        ,  92.        , 145.        , 137.        ,
       158.        , 167.        ,  94.        , 107.        ,
       230.        ,  49.        ,  75.        ,  91.        ,
       122.        ,  96.55621422,  67.        ,  83.        ,
        78.        ,  52.        ,  61.        ,  93.        ,
       148.        , 129.        ,  96.        ,  71.        ,
        98.        , 115.        ,  53.        ,  81.        ,
        79.        , 120.        , 152.        , 102.        ,
       108.        ,  68.        ,  58.        , 149.        ,
        89.        ,  63.        ,  48.        ,  66.        ,
       139.        , 103.        , 125.        , 133.        ,
       138.        , 135.        , 142.        ,  77.        ,
        57.85814468,  62.        , 132.        , 102.00825115,
        84.        ,  64.        ,  74.        ,  76.40523288,
       116.        ,  75.53953501,  82.        ])
Decision PointΒΆ

There are a few instances of trailing decimals that will be distracting for display purposes. This feature, however, will require domain experts to validate. Moreover, Vintage Classification is not primarily determined by horsepower, or any other performance features (see above). Thus, it seems better to leave these records as flagged entries by virtue of their trailing decimals, and allow domain experts to review.

InΒ [9]:
# Create an index column for visualization traceability
df['index'] = df.index
InΒ [10]:
# Review preprocessed data
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    398 non-null    float64
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   model year    398 non-null    int64  
 7   car name      398 non-null    object 
 8   index         398 non-null    int64  
dtypes: float64(4), int64(4), object(1)
memory usage: 28.1+ KB

Observations:ΒΆ

  • Make and Model, as discused above, can be sigificant in the determination of a vehicle's vintage classification
  • Most (89%) of the 'car name' feature is singular (designated as "Other" in frequency distribution)[2], which can be the very reason a vehicle could be classifed as "vintage"
  • Therefore, these 'car name' labels should be further reviewed, and disaggregated if possible

Exploratory Data Analysis and Hypothesis GenerationΒΆ

t-SNE_previewΒΆ

InΒ [11]:
# t-SNE analysis to review car name

from sklearn.manifold import TSNE
import plotly.express as px
import warnings

# Future warning suppressed
warnings.filterwarnings('ignore', category=FutureWarning)


# Prepare t-SNE data
features = df.drop('car name', axis=1)  
tsne = TSNE(n_components=2, random_state=42, perplexity=min(30, len(features)-1))
features_tsne = tsne.fit_transform(features)

# Create DataFrame
tsne_df = pd.DataFrame(features_tsne, columns=['t-SNE 1', 't-SNE 2'])
tsne_df['car name'] = df['car name']

# Plotly visualization with tuple for color parameter
fig = px.scatter(tsne_df, x='t-SNE 1', y='t-SNE 2', 
                 color='car name', 
                 hover_data=['car name'],
                 title='t-SNE Visualization of car name')

fig.update_layout(
   title='t-SNE Visualization of Car Name',
   autosize=False,
   width=800,
   height=300,
   xaxis=dict(
       title='',
       showticklabels=False, 
       showgrid=False, 
       zeroline=False
   ),
   yaxis=dict(
       title='',
       showticklabels=False, 
       showgrid=False, 
       zeroline=False
   )
)

fig.show()
InΒ [12]:
# t-SNE to review only "Other" in car name

# Identify car names with a frequency of 1
singular_names = df['car name'].value_counts()[df['car name'].value_counts() == 1].index

# Filter the dataset to include only those car names
singular_df = df[df['car name'].isin(singular_names)]

# Drop 'car name' for t-SNE, use other features for singular entries only
features_singular = singular_df.drop('car name', axis=1)

# Apply t-SNE to singular entries
tsne_singular = TSNE(n_components=2, random_state=42, perplexity=min(30, len(features_singular)-1))
features_tsne_singular = tsne_singular.fit_transform(features_singular)

# Create DataFrame for Plotly
tsne_singular_df = pd.DataFrame(features_tsne_singular, columns=['t-SNE 1', 't-SNE 2'])
tsne_singular_df['car name'] = singular_df['car name'].values

import plotly.express as px

# Plotly visualization
fig = px.scatter(tsne_singular_df, x='t-SNE 1', y='t-SNE 2', 
                 color='car name', 
                 hover_data=['car name'],  # Hover over points to see car name
                 title='t-SNE Visualization of Singular Frequency Car Names')

# Update layout for cleaner visualization
fig.update_layout(
    autosize=False,
    width=800,
    height=300,
    xaxis=dict(title='',showticklabels=False, showgrid=False, zeroline=False),
    yaxis=dict(title='',showticklabels=False, showgrid=False, zeroline=False),
    title='t-SNE Visualization of "Other" (Rare) Car Names'
)

fig.show()

Observations:ΒΆ

The data points are widely scattered throughout the t-SNE plots, indicating limited clear structure or clustering patterns, which suggests the need for additional feature refinement or alternative dimensionality reduction techniques to better capture relationships.

  • Although t-SNE visualized clusters by association with vehicle performance features, the display of the car name labels themselves show that:
    1. There are inconsistencies with how a same make of vehicle, i.e. Volkswagon, is designated as "VW"
    2. Vehicles' car name meaning can be improved by extracting the first term in the field
    3. Further grouping of car name may be performed using fuzzy matching as needed

Feature Engineering: Car Brand review and clean-upΒΆ

InΒ [13]:
# Extract the first word from 'car name'
df['car_brand'] = df['car name'].str.split().str[0]

# View the unique brands
print(df['car_brand'].unique())
['chevrolet' 'buick' 'plymouth' 'amc' 'ford' 'pontiac' 'dodge' 'toyota'
 'datsun' 'volkswagen' 'peugeot' 'audi' 'saab' 'bmw' 'chevy' 'hi'
 'mercury' 'opel' 'fiat' 'oldsmobile' 'chrysler' 'mazda' 'volvo' 'renault'
 'toyouta' 'maxda' 'honda' 'subaru' 'chevroelt' 'capri' 'vw'
 'mercedes-benz' 'cadillac' 'mercedes' 'vokswagen' 'triumph' 'nissan']
ObservationΒΆ
  • Many misspelled words, so fuzzy matching may be helpful
InΒ [14]:
# Selective review of extracted car brands
df[df['car_brand'] == 'hi']
Out[14]:
mpg cylinders displacement horsepower weight acceleration model year car name index car_brand
28 9.0 8 304.0 193.0 4732 18.5 70 hi 1200d 28 hi
ObservationsΒΆ
  • Internet search results for "1970 1200d" revealed that this is a vintage International Harvester: "Overall, the 1970 International Harvester 1200D is a rare and sought-after vintage pickup truck, prized for its ruggedness, reliability, and nostalgic appeal."
  • This confirms that 'car name' could be valuable in the Vintage Classification of the dataset
Decision PointΒΆ
  • "hi" will be renamed "harvester", and the car name will remain as it was entered in the dataset
InΒ [15]:
# Rename "hi" to "harvester"
df.loc[df['car_brand'] == 'hi', 'car_brand'] = 'harvester'
InΒ [16]:
# Confirm update 
df[df['car_brand'] == 'harvester']
Out[16]:
mpg cylinders displacement horsepower weight acceleration model year car name index car_brand
28 9.0 8 304.0 193.0 4732 18.5 70 hi 1200d 28 harvester

Clean-up car brand with mappingΒΆ

InΒ [17]:
# Cleanup typos with mapping of known inconsistencies

brand_mapping = {
    'chevy': 'chevrolet',
    'chevroelt': 'chevrolet',
    'vw': 'volkswagen',
    'vokswagen': 'volkswagen',
    'toyouta': 'toyota',
    'maxda': 'mazda',
    'mercedes': 'mercedes-benz'
}

# Apply the mapping to standardize the brands
df['car_brand'] = df['car_brand'].replace(brand_mapping)
InΒ [18]:
# Selective review of extracted car brands (capri vs ford vs mercury): Capri produced by Mercury, owned by Ford
print(df[df['car_brand'] == 'ford'][['model year', 'car name', 'index', 'car_brand']], "\n")
print(df[df['car_brand'] == 'mercury'][['model year', 'car name', 'index', 'car_brand']], "\n")
print(df[df['car_brand'] == 'capri'][['model year', 'car name', 'index', 'car_brand']], "\n")
     model year                  car name  index car_brand
4            70               ford torino      4      ford
5            70          ford galaxie 500      5      ford
17           70             ford maverick     17      ford
25           70                 ford f250     25      ford
32           71                ford pinto     32      ford
36           71           ford torino 500     36      ford
40           71          ford galaxie 500     40      ford
43           71  ford country squire (sw)     43      ford
48           71              ford mustang     48      ford
61           72       ford pinto runabout     61      ford
65           72          ford galaxie 500     65      ford
74           72     ford gran torino (sw)     74      ford
80           72           ford pinto (sw)     80      ford
88           73          ford gran torino     88      ford
92           73                  ford ltd     92      ford
100          73             ford maverick    100      ford
104          73              ford country    104      ford
112          73                ford pinto    112      ford
126          74             ford maverick    126      ford
130          74                ford pinto    130      ford
136          74          ford gran torino    136      ford
139          74     ford gran torino (sw)    139      ford
155          75             ford maverick    155      ford
159          75                  ford ltd    159      ford
166          75           ford mustang ii    166      ford
168          75                ford pinto    168      ford
174          75                ford pinto    174      ford
190          76          ford gran torino    190      ford
193          76             ford maverick    193      ford
200          76         ford granada ghia    200      ford
206          76                ford pinto    206      ford
214          76                 ford f108    214      ford
228          77              ford granada    228      ford
232          77          ford thunderbird    232      ford
236          77       ford mustang ii 2+2    236      ford
245          78               ford fiesta    245      ford
254          78      ford fairmont (auto)    254      ford
255          78       ford fairmont (man)    255      ford
264          78               ford futura    264      ford
282          79           ford fairmont 4    282      ford
286          79           ford ltd landau    286      ford
290          79  ford country squire (sw)    290      ford
314          80             ford fairmont    314      ford
336          80        ford mustang cobra    336      ford
351          81            ford escort 4w    351      ford
352          81            ford escort 2h    352      ford
365          81           ford granada gl    365      ford
373          82      ford fairmont futura    373      ford
389          82            ford granada l    389      ford
393          82           ford mustang gl    393      ford
396          82               ford ranger    396      ford 

     model year                  car name  index car_brand
49           71        mercury capri 2000     49   mercury
67           72           mercury marquis     67   mercury
90           73  mercury marquis brougham     90   mercury
113          73          mercury capri v6    113   mercury
154          75           mercury monarch    154   mercury
224          77   mercury cougar brougham    224   mercury
251          78      mercury monarch ghia    251   mercury
259          78            mercury zephyr    259   mercury
281          79          mercury zephyr 6    281   mercury
287          79     mercury grand marquis    287   mercury
379          82            mercury lynx l    379   mercury 

     model year  car name  index car_brand
184          76  capri ii    184     capri 

Decision PointΒΆ
  • "capri ii", identified with a "capri" car brand would be better classified as a "mercury" car brand when queried with other mercury capri vehicles in the data set
  • Thus, "mercury" will replace its brand, while "capri ii" will remain as its car name
InΒ [19]:
# Rename "capri" car brand to "mercury"
df.loc[df['car_brand'] == 'capri', 'car_brand'] = 'mercury'

# Confirm updated car brand
df[df['car_brand'] == 'mercury']
Out[19]:
mpg cylinders displacement horsepower weight acceleration model year car name index car_brand
49 23.0 4 122.0 86.0 2220 14.0 71 mercury capri 2000 49 mercury
67 11.0 8 429.0 208.0 4633 11.0 72 mercury marquis 67 mercury
90 12.0 8 429.0 198.0 4952 11.5 73 mercury marquis brougham 90 mercury
113 21.0 6 155.0 107.0 2472 14.0 73 mercury capri v6 113 mercury
154 15.0 6 250.0 72.0 3432 21.0 75 mercury monarch 154 mercury
184 25.0 4 140.0 92.0 2572 14.9 76 capri ii 184 mercury
224 15.0 8 302.0 130.0 4295 14.9 77 mercury cougar brougham 224 mercury
251 20.2 8 302.0 139.0 3570 12.8 78 mercury monarch ghia 251 mercury
259 20.8 6 200.0 85.0 3070 16.7 78 mercury zephyr 259 mercury
281 19.8 6 200.0 85.0 2990 18.2 79 mercury zephyr 6 281 mercury
287 16.5 8 351.0 138.0 3955 13.2 79 mercury grand marquis 287 mercury
379 36.0 4 98.0 70.0 2125 17.3 82 mercury lynx l 379 mercury
ObservationΒΆ
  • The "capri ii" is now categorized under the "mercury" car brand. However, further preprocessing of the car name field to extract explicit models (e.g., "car_model") will enhance subsequent reviews.
InΒ [20]:
# Review unique brands
print(df['car_brand'].unique())
['chevrolet' 'buick' 'plymouth' 'amc' 'ford' 'pontiac' 'dodge' 'toyota'
 'datsun' 'volkswagen' 'peugeot' 'audi' 'saab' 'bmw' 'harvester' 'mercury'
 'opel' 'fiat' 'oldsmobile' 'chrysler' 'mazda' 'volvo' 'renault' 'honda'
 'subaru' 'mercedes-benz' 'cadillac' 'triumph' 'nissan']
ObservationΒΆ
  • The list of car brands are now consistent and succinct
  • A fuzzy match will ensure there are no surprises between car_brand and car name
InΒ [21]:
from fuzzywuzzy import fuzz

# Function to calculate fuzzy match score
def calculate_fuzzy_score(row):
    return fuzz.ratio(row['car name'], row['car_brand'])

# Create a new DataFrame for testing
df_fuzzy_test = df[['car name', 'car_brand']].copy()
df_fuzzy_test['fuzzy_score'] = df_fuzzy_test.apply(calculate_fuzzy_score, axis=1)

# View potential mismatches
df_fuzzy_test_sorted = df_fuzzy_test.sort_values(by='fuzzy_score', ascending=True)

# Show mismatches for inspection
print(df_fuzzy_test_sorted.head(20))  # Display the lowest scores
                            car name   car_brand  fuzzy_score
28                          hi 1200d   harvester           12
45        amc hornet sportabout (sw)         amc           21
394                        vw pickup  volkswagen           21
293                 vw rabbit custom  volkswagen           23
96           amc ambassador brougham         amc           23
325             vw rabbit c (diesel)  volkswagen           27
184                         capri ii     mercury           27
263  buick regal sport coupe (turbo)       buick           28
43          ford country squire (sw)        ford           29
9                 amc ambassador dpl         amc           29
66                amc ambassador sst         amc           29
290         ford country squire (sw)        ford           29
326               vw dasher (diesel)  volkswagen           29
295      dodge colt hatchback custom       dodge           31
140                 amc matador (sw)         amc           32
72                  amc matador (sw)         amc           32
197                        vw rabbit  volkswagen           32
139            ford gran torino (sw)        ford           32
283                 amc concord dl 6         amc           32
309                        vw rabbit  volkswagen           32
ObservationΒΆ
  • The fuzzy match shows that car_brand can be reliabily used for visualization and clustering analysis
InΒ [22]:
# Review count of unique car brands

df['car_brand'].value_counts()
Out[22]:
car_brand
ford             51
chevrolet        47
plymouth         31
amc              28
dodge            28
toyota           26
datsun           23
volkswagen       22
buick            17
pontiac          16
honda            13
mercury          12
mazda            12
oldsmobile       10
peugeot           8
fiat              8
audi              7
chrysler          6
volvo             6
renault           5
subaru            4
opel              4
saab              4
mercedes-benz     3
bmw               2
cadillac          2
triumph           1
harvester         1
nissan            1
Name: count, dtype: int64
InΒ [23]:
# Review other single-record car brands
print(df[df['car_brand'] == 'nissan'], "\n")
print("---")
print("\n",df[df['car_brand'] == 'triumph'])
      mpg  cylinders  displacement  horsepower  weight  acceleration  \
380  36.0          4         120.0        88.0    2160          14.5   

     model year          car name  index car_brand  
380          82  nissan stanza xe    380    nissan   

---

       mpg  cylinders  displacement  horsepower  weight  acceleration  \
335  35.0          4         122.0        88.0    2500          15.1   

     model year           car name  index car_brand  
335          80  triumph tr7 coupe    335   triumph  
FindingsΒΆ

While Triumph is a single-record brand, similar to International Harvester, the Nissan Stanza is owned by Datsun and should be reviewed to determine whether it should be reassigned to this brand.

InΒ [24]:
df[df['car_brand'] == 'datsun']
Out[24]:
mpg cylinders displacement horsepower weight acceleration model year car name index car_brand
18 27.0 4 97.0 88.0 2130 14.5 70 datsun pl510 18 datsun
29 27.0 4 97.0 88.0 2130 14.5 71 datsun pl510 29 datsun
54 35.0 4 72.0 69.0 1613 18.0 71 datsun 1200 54 datsun
81 28.0 4 97.0 92.0 2288 17.0 72 datsun 510 (sw) 81 datsun
110 22.0 4 108.0 94.0 2379 16.5 73 datsun 610 110 datsun
129 31.0 4 79.0 67.0 1950 19.0 74 datsun b210 129 datsun
145 32.0 4 83.0 61.0 2003 19.0 74 datsun 710 145 datsun
173 24.0 4 119.0 97.0 2545 17.0 75 datsun 710 173 datsun
204 32.0 4 85.0 70.0 1990 17.0 76 datsun b-210 204 datsun
220 33.5 4 85.0 70.0 1945 16.8 77 datsun f-10 hatchback 220 datsun
241 22.0 6 146.0 97.0 2815 14.5 77 datsun 810 241 datsun
247 39.4 4 85.0 70.0 2070 18.6 78 datsun b210 gx 247 datsun
268 27.2 4 119.0 97.0 2300 14.7 78 datsun 510 268 datsun
273 23.9 4 119.0 97.0 2405 14.9 78 datsun 200-sx 273 datsun
303 31.8 4 85.0 65.0 2020 19.2 79 datsun 210 303 datsun
312 37.2 4 86.0 65.0 2019 16.4 80 datsun 310 312 datsun
320 37.0 4 119.0 92.0 2434 15.0 80 datsun 510 hatchback 320 datsun
324 40.8 4 85.0 65.0 2110 19.2 80 datsun 210 324 datsun
333 32.7 6 168.0 132.0 2910 11.4 80 datsun 280-zx 333 datsun
347 37.0 4 85.0 65.0 1975 19.4 81 datsun 210 mpg 347 datsun
357 32.9 4 119.0 100.0 2615 14.8 81 datsun 200sx 357 datsun
362 24.2 6 146.0 120.0 2930 13.8 81 datsun 810 maxima 362 datsun
385 38.0 4 91.0 67.0 1995 16.2 82 datsun 310 gx 385 datsun
Decision PointΒΆ

There are no "Stanza" models in the above 'datsun" filtered car brand dataset. According to an internet search, while Nissan vehicles were sold under the Datsun brand until 1983, the Datsun brand was subsequently discontinued, and the Stanza XE was rebranded as the Nissan Stanza XE. Therefore, the single "Nissan" record will remain unchanged in this analysis.

t-SNE_reviewΒΆ

InΒ [25]:
from sklearn.preprocessing import LabelEncoder
import plotly.express as px
import warnings

# Suppress selected Future Warning
warnings.filterwarnings(
    "ignore",
    message="When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas.*",
    category=FutureWarning
)


# Encode categorical car brands
le = LabelEncoder()
df['car_brand_encoded'] = le.fit_transform(df['car_brand'])

# Drop unnecessary columns and prepare features
features = df.drop(['car name', 'car_brand'], axis=1)

# Apply t-SNE
tsne = TSNE(n_components=2, random_state=42, perplexity=min(30, len(features)-1))
features_tsne = tsne.fit_transform(features)

# Create a DataFrame for visualization
tsne_df = pd.DataFrame(features_tsne, columns=['t-SNE 1', 't-SNE 2'])
tsne_df['car_brand'] = df['car_brand']
tsne_df['car_brand_encoded'] = df['car_brand_encoded']

# Plot the t-SNE visualization
fig = px.scatter(tsne_df, x='t-SNE 1', y='t-SNE 2', 
                 color='car_brand', 
                 hover_data=['car_brand'])

# Update layout for cleaner visualization
fig.update_layout(
    autosize=False,
    width=800,
    height=300,
    xaxis=dict(title='',showticklabels=False, showgrid=False, zeroline=False),
    yaxis=dict(title='',showticklabels=False, showgrid=False, zeroline=False),
    title='t-SNE Visualization of Car Brands'
    )

fig.show()

ObservationΒΆ

Clustering based on car brand and vintage classification still seem to be obscured by performance features. It seems reasonable therefore that removing these features will yield more coherent groupings (as it pertains to Vintage Classification than applying PCA for dimensionality reduction.

Hypothesis: Vintage ClassificationΒΆ

  • Vintage classification is primarily determined by the interaction between car name and model year.

Null Hypothesis (𝐻0):

  • The interaction between car name and model year sufficiently explains vintage classification without the need for additional feature (i.e. performance features) transformations such as PCA.

  • Implication:

    • Clustering without PCA (using car name and model year along with other original features) will yield clusters of similar quality as clustering after PCA.

    • PCA will not significantly improve cluster separability or performance because the most relevant variance is already captured by car name and model year.

Alternative Hypothesis (π»π‘Ž):

  • PCA will reveal latent features that significantly improve vintage classification beyond the interaction of car name and model year alone.

  • Implication:

    • Clustering after PCA will produce significantly better-defined clusters (e.g., higher silhouette scores, better cluster separability).
    • Other features or combinations of features (e.g., displacement, acceleration, or transformed horsepower) contain important latent information relevant to vintage classification.

KMeans without vs with Performance Features vs PCAΒΆ

InΒ [26]:
# Begin with only car brand and model year for 3-cluster scoring using Silhouette Scores
# "Core": without performance features
# "Full": with performance features
# "PCA": with performance features using PCA

from sklearn.preprocessing import LabelEncoder
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import warnings

# Suppress specific KMeans warning
warnings.filterwarnings("ignore", message="KMeans is known to have a memory leak on Windows with MKL")

# Label encode 'car_brand'
le = LabelEncoder()
df['car_brand_encoded'] = le.fit_transform(df['car_brand'])

# Core
# Drop performance features and keep 'car_brand_encoded'
features_core = df[['model year', 'car_brand_encoded']]

# Fit KMeans 
kmeans_core = KMeans(n_clusters=3, random_state=42, n_init=10)
labels_core = kmeans_core.fit_predict(features_core)

# Evaluate cluster quality
silhouette_core = silhouette_score(features_core, labels_core)
print(f'Silhouette Score (Core Features - Null Hypothesis): {silhouette_core}')

# Include all features, testing null hypothesis 
features_full = df[['model year', 'car_brand_encoded', 'weight', 'displacement']]

# Fit KMeans
kmeans_full = KMeans(n_clusters=3, random_state=42, n_init=10)
labels_full = kmeans_full.fit_predict(features_full)

# Evaluate cluster quality
silhouette_full = silhouette_score(features_full, labels_full)
print(f'Silhouette Score (Full Features with Non-Performance Metrics): {silhouette_full}')

# Compare alternative hypothesis with PCA + KMeans
from sklearn.decomposition import PCA

# Apply PCA to features before running KMeans
pca = PCA(n_components=2)  # Reduce to 2 dimensions
#features_pca = pca.fit_transform(features_core) Silhouette Score (PCA Features - Alternative Hypothesis): 0.4035467744569793
features_pca = pca.fit_transform(features_full)

# Fit KMeans on PCA-reduced features
kmeans_pca = KMeans(n_clusters=3, random_state=42, n_init=10)
labels_pca = kmeans_pca.fit_predict(features_pca)

# Evaluate cluster quality
silhouette_pca = silhouette_score(features_pca, labels_pca)
print(f'Silhouette Score (PCA Features - Alternative Hypothesis): {silhouette_pca}')
Silhouette Score (Core Features - Null Hypothesis): 0.40354677446266135
Silhouette Score (Full Features with Non-Performance Metrics): 0.5871997958450826
Silhouette Score (PCA Features - Alternative Hypothesis): 0.587706584336502
Silhouette ScoreΒΆ
  • Evaluation of the quality of clustering as represented by the Silhouette Score:

    1. Cluster Validity: A higher score indicates that clusters are well-formed and distinct.
    2. Optimal Number of Clusters: The silhouette score can help reveal and identify the ideal number of clusters.
    3. Cluster Interpretability: Assess whether the clustering reflects meaningful patterns in the data (e.g., does it align with vintage classification?).

Findings

  1. Core Features (Null Hypothesis):
  • Silhouette Score: 0.4035
  • This reflects a moderate clustering quality, suggesting that while model year and car brand alone can form clusters, they lack the nuance to capture fully distinct vintage classifications.
  1. Full Features (Non-Performance Metrics):
  • Silhouette Score: 0.5872
  • A substantial improvement, showing that adding weight and displacement leads to more cohesive and well-separated clusters.
  • These features seem to carry valuable information about vehicle design or era-related trends, which help refine vintage classification.
  1. PCA on Full Features (Alternative Hypothesis):
  • Silhouette Score: 0.5877
  • Almost identical to the Full Features model without PCA.
  • This suggests that PCA successfully reduces dimensionality while preserving important patterns, but it doesn’t improve cluster separability significantly beyond the raw features.
Next StepΒΆ
  • Compare results with scaling
InΒ [27]:
# Scaling all 3 models to compare

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Initialize StandardScaler
scaler = StandardScaler()

# Scale Core Features
features_core_scaled = scaler.fit_transform(features_core)

# Scale Full Features
features_full_scaled = scaler.fit_transform(features_full)

# Step 1: PCA on Scaled Full Features
pca = PCA(n_components=3)  
features_pca_scaled = pca.fit_transform(features_full_scaled)

# Step 2: Fit KMeans and Evaluate Silhouette Scores
print('Scaled Silhouette Scoring:')
# Core Features Model
kmeans_core = KMeans(n_clusters=3, random_state=42, n_init=10)
labels_core = kmeans_core.fit_predict(features_core_scaled)
silhouette_core = silhouette_score(features_core_scaled, labels_core)
print(f'Silhouette Score (Core Features - Scaled): {silhouette_core}')

# Full Features Model
kmeans_full = KMeans(n_clusters=3, random_state=42, n_init=10)
labels_full = kmeans_full.fit_predict(features_full_scaled)
silhouette_full = silhouette_score(features_full_scaled, labels_full)
print(f'Silhouette Score (Full Features - Scaled): {silhouette_full}')

# PCA Features Model
kmeans_pca = KMeans(n_clusters=2, random_state=42, n_init=10)
labels_pca = kmeans_pca.fit_predict(features_pca_scaled)
silhouette_pca = silhouette_score(features_pca_scaled, labels_pca)
print(f'Silhouette Score (PCA Features - Scaled): {silhouette_pca}')
Scaled Silhouette Scoring:
Silhouette Score (Core Features - Scaled): 0.42145239754250424
Silhouette Score (Full Features - Scaled): 0.33382797545291965
Silhouette Score (PCA Features - Scaled): 0.3916454191001744
FindingΒΆ
  • Core Features show the highest silhouette score when compared to full features with and without PCA
InΒ [28]:
# Visualize the 3 models together

import plotly.express as px
from sklearn.manifold import TSNE
import pandas as pd
import warnings

# Suppress the FutureWarning from Plotly
warnings.filterwarnings("ignore", message="When grouping with a length-1 list-like")

# Decode 'car_brand_encoded' back to original 'car_brand'
df['car_brand'] = le.inverse_transform(df['car_brand_encoded'])

# t-SNE for Core Features
tsne_core = TSNE(n_components=2, random_state=42, perplexity=30)
features_tsne_core = tsne_core.fit_transform(features_core)
tsne_df_core = pd.DataFrame(features_tsne_core, columns=['t-SNE 1', 't-SNE 2'])
tsne_df_core['car brand'] = df.loc[features_core.index, 'car_brand']
tsne_df_core['cluster'] = labels_core
tsne_df_core['model'] = 'Core Features'
tsne_df_core['index'] = features_core.index
tsne_df_core['car name'] = df.loc[features_core.index, 'car name']
tsne_df_core['model year'] = df.loc[features_core.index, 'model year']

# t-SNE for Full Features
tsne_full = TSNE(n_components=2, random_state=42, perplexity=30)
features_tsne_full = tsne_full.fit_transform(features_full)
tsne_df_full = pd.DataFrame(features_tsne_full, columns=['t-SNE 1', 't-SNE 2'])
tsne_df_full['car brand'] = df.loc[features_full.index, 'car_brand']
tsne_df_full['cluster'] = labels_full
tsne_df_full['model'] = 'Full Features'
tsne_df_full['index'] = features_full.index
tsne_df_full['car name'] = df.loc[features_full.index, 'car name']
tsne_df_full['model year'] = df.loc[features_full.index, 'model year']

# t-SNE for PCA Features
tsne_pca = TSNE(n_components=2, random_state=42, perplexity=30)
features_tsne_pca = tsne_pca.fit_transform(features_pca)
tsne_df_pca = pd.DataFrame(features_tsne_pca, columns=['t-SNE 1', 't-SNE 2'])
tsne_df_pca['car brand'] = df.loc[features_full.index, 'car_brand']
tsne_df_pca['cluster'] = labels_pca
tsne_df_pca['model'] = 'PCA Features'
tsne_df_pca['index'] = features_full.index
tsne_df_pca['car name'] = df.loc[features_full.index, 'car name']
tsne_df_pca['model year'] = df.loc[features_full.index, 'model year']

# Combine t-SNE DataFrames
tsne_combined = pd.concat([tsne_df_core, tsne_df_full, tsne_df_pca])

# Plot with facet by model and custom hover
fig = px.scatter(tsne_combined, 
                 x='t-SNE 1', 
                 y='t-SNE 2', 
                 color='cluster', 
                 facet_col='model', 
                 title='t-SNE Comparison of Core vs Full vs Full with PCA')

# Customize hovertemplate
fig.update_traces(
    hovertemplate='<br>Car Brand: %{customdata[0]}<br>Car Name: %{customdata[1]}<br>Model Year: %{customdata[2]}<br>Cluster: %{customdata[3]}<extra></extra>',
    customdata=tsne_combined[['car brand', 'car name', 'model year', 'cluster']].to_numpy()
)

# Suppress t-SNE axis labels for all facets
fig.update_xaxes(title='', showticklabels=False, showgrid=False, zeroline=False, matches='x')
fig.update_yaxes(title='', showticklabels=False, showgrid=False, zeroline=False, matches='y')

# Control dimensions
fig.update_layout(
    autosize=False,
    width=800,
    height=400,
    coloraxis_showscale=False 
)

fig.show()
General ObservationsΒΆ
  • It seems PCA and the Full Features without PCA are very much alike; the shape of these models also resemble the original t-SNE preview and review visualizations
  • Core Features model seems to reflect clear, distinct clusters, with apparent separation between clusters; this seem to suggest meaningful differences between data points
  • An evaluation of optimal number of clusters should be examined next

ElbowPlotsΒΆ

InΒ [29]:
# Review optimal clusters with Elbow Plot

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Define a function to create the elbow plot
def plot_elbow(features, title):
    inertia_values = []
    cluster_range = range(1, 11)  # Test 1 to 10 clusters

    for k in cluster_range:
        kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
        kmeans.fit(features)
        inertia_values.append(kmeans.inertia_)

    # Plot the elbow plot
    plt.figure(figsize=(4, 3))
    plt.plot(cluster_range, inertia_values, marker='o', linestyle='--')
    plt.title(f'{title}', fontsize=10)
    plt.xlabel('Number of Clusters (k)', fontsize=8)
    plt.ylabel('Inertia (WCSS)', fontsize=8)
    plt.yticks(fontsize=8)
    plt.grid(True)
    plt.show()

# Elbow Plot for Core Features (Only Car Brand and Model Year)
plot_elbow(features_core, 'Core Features (Car Brand + Model Year)')

# Elbow Plot for Full Features (Model Year + Car Brand + Non-Performance Metrics)
plot_elbow(features_full, 'Full Features')

# Elbow Plot for PCA Features (Dimensionality Reduced Features)
plot_elbow(features_pca, 'PCA Features (Full Features Reduced to 2D)')
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
FindingsΒΆ
  • All 3 models agree that k = 2 is the optimum number of clusters
  • Re-run KMeans where k = 2 for all 3 models
  • Also re-run as k=2 to compare scaling for all 3 models

Clustering iterations with 2 ClustersΒΆ

InΒ [30]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import warnings

# Suppress specific KMeans warning
warnings.filterwarnings("ignore", message="KMeans is known to have a memory leak on Windows with MKL")

def preprocess_data(df):
    """Encodes categorical variables and prepares data for clustering."""
    # Label encode 'car_brand'
    le = LabelEncoder()
    df['car_brand_encoded'] = le.fit_transform(df['car_brand'])
    return df

def get_features(df):
    """Define the feature sets for clustering."""
    # Core features: minimal set
    features_core = df[['model year', 'car_brand_encoded']]
    # Full features: include additional metrics
    features_full = df[['model year', 'car_brand_encoded', 'weight', 'displacement']]
    return features_core, features_full

def calculate_silhouette_scores(features_core, features_full):
    """Calculates silhouette scores for different feature sets, scaling, and cluster counts."""
    results = []
    scaler = StandardScaler()

    # Core Features: Scaled and Unscaled
    features_core_scaled = scaler.fit_transform(features_core)
    for scaling, features in zip(["Unscaled", "Scaled"], [features_core, features_core_scaled]):
        for k in [2, 3]:
            kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
            labels = kmeans.fit_predict(features)
            silhouette = silhouette_score(features, labels)
            results.append({
                "Feature Set": "Core Features",
                "Scaling": scaling,
                "Clusters": k,
                "Silhouette Score": silhouette
            })

    # Full Features: Scaled and Unscaled
    features_full_scaled = scaler.fit_transform(features_full)
    for scaling, features in zip(["Unscaled", "Scaled"], [features_full, features_full_scaled]):
        for k in [2, 3]:
            kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
            labels = kmeans.fit_predict(features)
            silhouette = silhouette_score(features, labels)
            results.append({
                "Feature Set": "Full Features",
                "Scaling": scaling,
                "Clusters": k,
                "Silhouette Score": silhouette
            })

    # PCA Features: Scaled and Unscaled
    for scaling, features in zip(["Unscaled", "Scaled"], [features_full, features_full_scaled]):
        pca = PCA(n_components=2)
        features_pca = pca.fit_transform(features)
        for k in [2, 3]:
            kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
            labels = kmeans.fit_predict(features_pca)
            silhouette = silhouette_score(features_pca, labels)
            results.append({
                "Feature Set": "PCA Features",
                "Scaling": scaling,
                "Clusters": k,
                "Silhouette Score": silhouette
            })

    return pd.DataFrame(results)

def print_results(df_silhouette):
    """Prints silhouette scores in the desired format."""
    for scaling in ["Unscaled", "Scaled"]:
        print(f"2-Cluster, {scaling} Silhouette Scoring:")
        filtered = df_silhouette[(df_silhouette["Clusters"] == 2) & (df_silhouette["Scaling"] == scaling)]
        for _, row in filtered.iterrows():
            print(f"Silhouette Score ({row['Feature Set']} - {scaling}): {row['Silhouette Score']}")
        print()

def main(df):
    # Step 1: Preprocess Data
    df = preprocess_data(df)

    # Step 2: Get Feature Sets
    features_core, features_full = get_features(df)

    # Step 3: Calculate Silhouette Scores
    df_silhouette = calculate_silhouette_scores(features_core, features_full)

    # Step 4: Print Results
    print_results(df_silhouette)

    # Step 5: Return DataFrame for further analysis
    return df_silhouette

# Run the models: Full vs Core, Scaled vs Unscaled, Produce corresponding silhouette scores
df_silhouette = main(df)
2-Cluster, Unscaled Silhouette Scoring:
Silhouette Score (Core Features - Unscaled): 0.5565510187158499
Silhouette Score (Full Features - Unscaled): 0.6355437404585835
Silhouette Score (PCA Features - Unscaled): 0.6357939678511504

2-Cluster, Scaled Silhouette Scoring:
Silhouette Score (Core Features - Scaled): 0.39066681853629537
Silhouette Score (Full Features - Scaled): 0.3851821168576777
Silhouette Score (PCA Features - Scaled): 0.4796259476891685

Statistical Benchmarking and Cross-Context ValidationΒΆ

  • While silhouette scores are meaningful within their immediate contexts for comparison, i.e. 2-cluster vs 3-cluster, scaled vs unscaled, we can more meaningfully compare across models using statistical comparisons of the clustering models
  • Coefficient of Variation (CV) and Relative Range can be applied as key metrics to provide a global measure of clustering performance and stability.
  • Although all the elbow plots of the three models have unanimously identified having 2 clusters as optimal, like the sihouette score, the elbow point is not an absolute metric.
    • The elbow plot provides guidance rather than definitive answers in the absolute sense.
    • The results from the elbow plots must therefore be interpreted in conjunction with other metrics like stability and reliability metrics, as well as domain knowledge.

Stability and Reliability MetricsΒΆ

InΒ [31]:
# Coefficient of Variation: Quantify relative variability across model scenarios

# CV by Feature Set and Scaling (Aggregate Clusters)
cv_results_feature_scaling = df_silhouette.groupby(['Feature Set', 'Scaling'])['Silhouette Score'].apply(
    lambda x: (np.std(x, ddof=1) / np.mean(x)) * 100
)
print("\nCoefficient of Variation (CV) by Feature Set and Scaling:")
print(cv_results_feature_scaling)

# CV by Feature Set and Clusters (Aggregate Scaling)
cv_results_feature_clusters = df_silhouette.groupby(['Feature Set', 'Clusters'])['Silhouette Score'].apply(
    lambda x: (np.std(x, ddof=1) / np.mean(x)) * 100
)
print("\nCoefficient of Variation (CV) by Feature Set and Clusters:")
print(cv_results_feature_clusters)

# CV by Scaling and Clusters (Aggregate Feature Set)
cv_results_scaling_clusters = df_silhouette.groupby(['Scaling', 'Clusters'])['Silhouette Score'].apply(
    lambda x: (np.std(x, ddof=1) / np.mean(x)) * 100
)
print("\nCoefficient of Variation (CV) by Scaling and Clusters:")
print(cv_results_scaling_clusters)
Coefficient of Variation (CV) by Feature Set and Scaling:
Feature Set    Scaling 
Core Features  Scaled       5.360960
               Unscaled    22.537358
Full Features  Scaled      10.100793
               Unscaled     5.591415
PCA Features   Scaled       9.242436
               Unscaled     5.558300
Name: Silhouette Score, dtype: float64

Coefficient of Variation (CV) by Feature Set and Clusters:
Feature Set    Clusters
Core Features  2           24.766815
               3            3.069382
Full Features  2           34.687551
               3           38.904567
PCA Features   2           19.800160
               3           23.408187
Name: Silhouette Score, dtype: float64

Coefficient of Variation (CV) by Scaling and Clusters:
Scaling   Clusters
Scaled    2           12.668079
          3           12.855771
Unscaled  2            7.496989
          3           20.180273
Name: Silhouette Score, dtype: float64

Findings: Coefficient of Variation: Stability / ReliabilityΒΆ

  • The CV analysis demonstrates that Full Features with 3 Clusters is a highly variable model and therefore less reliable for drawing robust insights.
  • The Core Features with 3 Clusters has the lowest CV (3.07%):
    • This indicates that the silhouette scores for this configuration are highly consistent across different contexts.
    • It also suggests that the clustering results are stable, making this model more reliable for drawing conclusions.
InΒ [32]:
# Range and Relative Range by Feature Set and Scaling

range_feature_scaling = df_silhouette.groupby(['Feature Set', 'Scaling'])['Silhouette Score'].agg(
    Range=lambda x: x.max() - x.min(),
    Relative_Range=lambda x: ((x.max() - x.min()) / x.max()) * 100 if x.max() > 0 else np.nan
)

print("\nRange and Relative Range by Feature Set and Scaling:")
print(range_feature_scaling)

# Range and Relative Range by Feature Set and Clusters
range_feature_clusters = df_silhouette.groupby(['Feature Set', 'Clusters'])['Silhouette Score'].agg(
    Range=lambda x: x.max() - x.min(),
    Relative_Range=lambda x: ((x.max() - x.min()) / x.max()) * 100 if x.max() > 0 else np.nan
)

print("\nRange and Relative Range by Feature Set and Clusters:")
print(range_feature_clusters)

# Range and Relative Range by Scaling and Clusters
range_scaling_clusters = df_silhouette.groupby(['Scaling', 'Clusters'])['Silhouette Score'].agg(
    Range=lambda x: x.max() - x.min(),
    Relative_Range=lambda x: ((x.max() - x.min()) / x.max()) * 100 if x.max() > 0 else np.nan
)

print("\nRange and Relative Range by Scaling and Clusters:")
print(range_scaling_clusters)
Range and Relative Range by Feature Set and Scaling:
                           Range  Relative_Range
Feature Set   Scaling                           
Core Features Scaled    0.030786        7.304640
              Unscaled  0.153004       27.491504
Full Features Scaled    0.051354       13.332431
              Unscaled  0.048344        7.606706
PCA Features  Scaled    0.058845       12.268954
              Unscaled  0.048087        7.563360

Range and Relative Range by Feature Set and Clusters:
                           Range  Relative_Range
Feature Set   Clusters                          
Core Features 2         0.165884       29.805749
              3         0.017906        4.248552
Full Features 2         0.250362       39.393295
              3         0.253372       43.149167
PCA Features  2         0.156168       24.562677
              3         0.166926       28.402902

Range and Relative Range by Scaling and Clusters:
                      Range  Relative_Range
Scaling  Clusters                          
Scaled   2         0.094444       19.691143
         3         0.087624       20.791060
Unscaled 2         0.079243       12.463621
         3         0.184160       31.335332
Findings: Range and Relative RangeΒΆ
  • By Scaling: Core Features (Scaled) has the lowest relative range among scaled sets (7.30), indicating high stability in scaled data. Core Features (Unscaled) has the lowest relative range among unscaled sets (27.49), showing greater stability in raw form compared to other unscaled features.
  • By Clusters: The most stable cluster configuration appears for Core Features with 3 Clusters (4.25), which has the lowest relative range overall.
  • By Scaling and Clusters: Unscaled with 2 Clusters has the lowest relative range (12.46), suggesting that unscaled features are more stable with fewer clusters.

Therefore, the most stable overall configuration is Core Features with 3 Clusters. Most stable scaled data is Core Features (Scaled). Most stable unscaled data is Core Features (Unscaled) (27.49). From these, Core Features with 3 Clusters is the most stable and reliable configuration overall.

Visualization of FindingsΒΆ

InΒ [33]:
import seaborn as sns
import matplotlib.pyplot as plt

# Add Adjusted Silhouette Score (1 - x)
df_silhouette['Adjusted Score'] = 1 - df_silhouette['Silhouette Score']

# Create the figure and subplots
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Subplot 1: Bar Chart
sns.barplot(ax=axes[0], x='Feature Set', y='Adjusted Score', data=df_silhouette, ci=None)
axes[0].set_title('Average Reliability/Stability Score by Feature Set')
axes[0].set_ylabel('Adjusted Silhouette Score (1 - Silhouette Score)')
axes[0].set_xlabel('Feature Set')

# Subplot 2: Heatmap
heatmap_data = df_silhouette.pivot_table(
    index='Feature Set', columns='Clusters', values='Adjusted Score', aggfunc='mean'
)
sns.heatmap(heatmap_data, annot=True, fmt='.2f', cmap='coolwarm', ax=axes[1])
axes[1].set_title('Reliability/Stability Measure: Feature Set x Clusters')
axes[1].set_xlabel('Number of Clusters')
axes[1].set_ylabel('Feature Set')

# Adjust layout for better spacing
plt.tight_layout()

# Show the combined figure
plt.show()
No description has been provided for this image

Findings: Magnitude of variability relative to the highest silhouette score in each contextΒΆ

  • Core Features with 3 Clusters consistently show the lowest Relative Range across all groupings, further validating their stability and reliability.
  • Full Features with 3 Clusters consistently have the highest Relative Range, indicating significant instability.

Hypothesis Testing Results, Conclusion & RecommendationsΒΆ

Null Hypothesis (𝐻0):

  • The interaction between car name and model year sufficiently explains vintage classification without PCA.

Supporting Evidence:

  1. Core Features with 3 Clusters consistently outperforms other feature sets in CV (3.07%) and Relative Range (4.25%).
  2. Low silhouette score variability confirms that adding more dimensions with Full Feature Set, or reducing dimensionality using PCA, introduces noise rather than clarity.

Alternative Hypothesis (π»π‘Ž):

  • PCA reveals latent features that significantly improve vintage classification.

Counter Evidence:

  1. PCA Features with 2 Clusters improve relative variability over most setups, but PCA features introduce higher complexity without outperforming.
  2. No configuration after PCA significantly outperforms Core Clusters for Vintage scoring.

Conclusion & Recommendations

Support for Null Hypothesis (𝐻0):

  • Clustering based on car name and model year is sufficient for stable and reliable vintage classification.
  • Core Features with 3 Clusters demonstrates lowest variability and most consistent clustering behavior.

1. Conclusion: The Core Features with 3 Clusters (see Cluster Diagram and Data Table below) can be leveraged to extract valuable insights about the cars, grouping them to more effectively target the audience. Further analysis with domain experts will be necessary to achieve this.

2. Recommendations:

  • Start with the simpler Core Features, avoiding unnecessary transformations.
    • PCA may introduce latent dimensions, but for this dataset, performance improvement was marginal with greater variability in cluster quality.
    • Domain expert(s) should review the 3-Cluster report to apply labels onto the dataset; these labels then can be further leveraged in supervised learning (see below).
  • In collaboration with domain expert(s):
    1. Further feature engineer the dataset, i.e. the vehicles' model using 'car name' to enhance clustering by car_brand, year, and car_model; performance features (i.e. horespower) should be reviewed and validated as reliable basis for classifying "Vintage" status (see below).
    2. Further iterations in unsupervised learning may be pursued, exploring the use of multi-model ensemble learning.
    3. Supervised learning can be used in lieu of unsupervised learning if Vintage Classification relies heavily on domain expert labelling.
    4. Use of Vintage Classification can be further discussed as it pertains to binary (i.e. Vintage vs Non-Vintage) versus multiple classes (i.e. Classic vs Retro vs Modern vs Emerging Vintage).
    5. Vintage Classification can be explored as a continuous score using a regression-based approach to combine weighted factors.
    6. The model can be leveraged to target audience segmentation: Group potential buyers by their affinity for specific vintage categories.
    7. The model can be leveraged for marketing insights: Identify which features drive perceptions of vintage value.